# Video-Text Contrastive Learning

Xclip Base Patch16 Zero Shot
MIT
X-CLIP is a minimalist extension of CLIP for general video-language understanding, trained via contrastive learning to match videos and texts.
Text-to-Video Transformers English
X
aurelio-ai
22
1
Xclip Base Patch16 Kinetics 600 16 Frames
MIT
X-CLIP is an extension of CLIP for general video-language understanding, supporting zero-shot, few-shot, or fully supervised video classification, as well as video-text retrieval tasks.
Text-to-Video Transformers English
X
microsoft
393
2
Xclip Base Patch16 Kinetics 600
MIT
X-CLIP is an extended version of CLIP for general video-language understanding, trained via contrastive learning on (video, text) pairs.
Text-to-Video Transformers English
X
microsoft
294
1
Xclip Base Patch16 Ucf 4 Shot
MIT
X-CLIP is a minimal extension of CLIP for general video-language understanding, trained via contrastive learning with (video, text) pairs.
Video Processing Transformers English
X
microsoft
16
0
Xclip Base Patch16 Ucf 2 Shot
MIT
X-CLIP is a minimalist extension of CLIP for general video-language understanding. The model is trained on (video, text) pairs through contrastive learning.
Text-to-Video Transformers English
X
microsoft
51
1
Xclip Base Patch16 Hmdb 8 Shot
MIT
X-CLIP is an extended version of CLIP for general video-language understanding, trained through contrastive learning on video-text pairs, suitable for video classification and video-text retrieval tasks.
Text-to-Video Transformers English
X
microsoft
17
1
Xclip Large Patch14 16 Frames
MIT
X-CLIP is an extension of CLIP for general video-language understanding, achieving video classification and video-text retrieval tasks through contrastive learning.
Text-to-Video Transformers English
X
microsoft
678
3
Xclip Large Patch14
MIT
X-CLIP is an extension of CLIP for general video-language understanding, trained via contrastive learning on (video, text) pairs.
Text-to-Video Transformers English
X
microsoft
1,698
11
Xclip Base Patch16 16 Frames
MIT
X-CLIP is a minimalist extension of CLIP for general video-language understanding, trained via contrastive learning on (video, text) pairs.
Text-to-Video Transformers English
X
microsoft
1,034
0
Xclip Base Patch32 16 Frames
MIT
X-CLIP is an extended version of CLIP for general video-language understanding, trained on video-text pairs via contrastive learning, suitable for tasks like video classification and video-text retrieval.
Text-to-Video Transformers English
X
microsoft
901
4
Xclip Base Patch32
MIT
X-CLIP is an extended version of CLIP for general video-language understanding, trained on (video, text) pairs via contrastive learning, suitable for tasks like video classification and video-text retrieval.
Text-to-Video Transformers English
X
microsoft
309.80k
84
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase